1 Chapter 1: Introduction

insert intro here: why this dataset, background research on topic and data, what this data and report about, table of contents description

Seattle boasts among the “hottest” housing markets in the United States; as of July 2018, Seattle “led the nation in home price gains” for 21 straight months, topped only by Portland in the 1990s – a trend driven by the city’s tech sector and a lack of supply compared with demand (https://bit.ly/2v5UMcn). Given the Seattle housing market’s notoriety for high prices, we were interested in exploring which variables affect housing price in this market. With this goal in mind, we found a public dataset on Kaggle (“House Sales in King County, USA”) offering 21,613 observations across 21 variables. According to Kaggle, it “includes homes sold between May 2014 and May 2015.” Although it doesn’t explore macro-level variables affecting housing price (such as the local job market, Amazon presence, etc.), it does focus on micro-level variables, such as renovations, number of bedrooms, square feet of living space, etc. that are common to virtually all housing markets in the United States. As a result, our analysis could lay the groundwork for future comparative analysis with other housing markets across the country.

This report is organized as follows:

  1. Description of the Data (explanation of the dataset and its variables,
  2. Geographic Coverage of Data
  3. House Prices vs. Size
  4. Factors Analysis
  5. Multiple Linear Regression Model
  6. Conclusion

2 Chapter 2: Description Data

2.1 Source Data

can delete links and explain dataset and variables and all that, as well as what done with cleaning coding (including reasoning for log price, whici will later be more clear with graphs)

As mentioned previously, our dataset houses 21,613 observations across 21 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and come from the following link: https://bit.ly/2MsyRFl; astericks next to variable name indicates usage in our analysis:

## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7129300520 6414100192 5631500400 2487200875 1954400510 ...
##  $ date         : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
  1. Id (unique ID for each home sold)
  2. Date (date of the home sale)
  3. Price (price of each home sold)***
  4. Bedrooms (number of bedrooms)***
  5. Bathrooms (number of bathrooms, where .5 accounts for a room with a toilet but no shower)
  6. Sqft_living (square footage of the apartments’ interior living space)***
  7. Sqft_lot (square footage of the land space)***
  8. Floors (number of floors)
  9. Waterfront (a dummy variable for whether the apartment was overlooking the waterfront or not; 0 represents no waterfront)
  10. View (an index from 0 to 4 of how the view of the property was)
  11. Condition (an index from 1 to 5 on the condition of the apartment, with the lowest number representing poor condition)***
  12. Grade (an index from 1 to 13, with the lowest number representation poor construction and design)***
  13. Sqft_above (the the square footage of the interior housing space that is above ground level)
  14. Sqft_basement (the square footage of the interior housing space that is below ground level)
  15. yr_built (the year the house was initially built)***
  16. yr_renovated (the year of the house’s last renovation)***
  17. zipcode (what zipcode area the house is in)***
  18. Lat (latitude)***
  19. Long (longitude)***
  20. Sqft_living15 (the square footage of the interior housing living space for the nearest 15 neighbors)
  21. Sqft_lot15 (the square footage of the land lots of the nearest 15 neighbors)

For our exploratory data analysis, we ignored “Id” and “Date” because these are independent variables with no relation to price. We also ignored “floors” because it can be considered a proxy for sqft_living. “Waterfront” and “View” were dropped because the vast majority of properties were coded as “0”. We ignored sqft_basement and sqft_above because they were corollaries of “sqft_living” (we didn’t want redundancy in our analysis). We also ignored “sqft_living 15” and “sqft_lot15” because we were interested only in the attributes of individual houses, not those of their surrounding neighborhoods (although that could make for an interesting follow-up study).

Following these decisions, we cleaned the data accordingly: we dropped “waterfront” and “view”; we subsetted the dataset to include only properties with more than 0 bedrooms and bathrooms (we considered these “outlier” properties); we subsetted the dataset to include only properties with less than 30 bedrooms (given the likely mistake of recording that many rooms in much smaller houses in terms of sqft); we dropped “NA” values from the dataset to simplify our analysis (“NA” values are hard to perform operations on); we converted “condition” and “grade” into factor variables because they are effectively intervals; and we ran “housing price” through a logarithmic function to make for better visualization.

2.2 Geographic Coverage of Data

Below is a visualization of the points in the dataset by price, plotted with the leaflet library. Note that the data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. More expensive houses tend to be concentrated near the water and center of the city.

Here instead is a visualization of the observations by property lot sqft. Again, data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. Our observation follows common sense: the further one ventures outside the city center, the more land there is.

3 Chapter 3: House prices and sizes

3.1 House Price

descibe how price is min max etc, then inline code for price per sqft and how expensive seattle is etc

A brief overview of the dataset yields the following observations for housing price: the minimum price is $78,000, while the maximum is $7,700,000 (quite a large range); the mean of the dataset is $540,198 (indicating that the dataset is right-skewed, as further indicated by the histogram below); the standard deviation of the dataset is $ 367,142; and the variance is 134,792,956,735 (quite large, indicating “that the data points are very spread out from the mean, and from one another” (https://bit.ly/2MZ1cCn).

Just for context, the following readouts offer cross sections of Seattle’s most expensive houses; average prices for each condition level; and average prices for each grade level.

3.1.1 What do the most expensive houses look like?

From slicing the data, it looks like the most expensive houses are very well constructed, have tens of thousands of square feet of property, and have 5 or more bedrooms.

##           id            date   price bedrooms bathrooms sqft_living
## 1 8907500070 20150413T000000 5350000        5      5.00        8000
## 2 9808700762 20140611T000000 7062500        5      4.50       10040
## 3 2470100110 20140804T000000 5570000        5      5.75        9200
## 4 6762700020 20141013T000000 7700000        6      8.00       12050
## 5 9208900037 20140919T000000 6885000        6      7.75        9890
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1    23985    2.0         3    12       6720          1280     2009
## 2    37325    2.0         3    11       7680          2360     1940
## 3    35069    2.0         3    13       6200          3000     2001
## 4    27600    2.5         4    13       8570          3480     1910
## 5    31374    2.0         3    13       8860          1030     2001
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1            0   98004 47.6 -122          4600      21750
## 2         2001   98004 47.6 -122          3930      25449
## 3            0   98039 47.6 -122          3560      24345
## 4         1987   98102 47.6 -122          3940       8800
## 5            0   98039 47.6 -122          4540      42730
##           id            date   price bedrooms bathrooms sqft_living
## 1 6762700020 20141013T000000 7700000        6      8.00       12050
## 2 9808700762 20140611T000000 7062500        5      4.50       10040
## 3 9208900037 20140919T000000 6885000        6      7.75        9890
## 4 2470100110 20140804T000000 5570000        5      5.75        9200
## 5 8907500070 20150413T000000 5350000        5      5.00        8000
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1    27600    2.5         4    13       8570          3480     1910
## 2    37325    2.0         3    11       7680          2360     1940
## 3    31374    2.0         3    13       8860          1030     2001
## 4    35069    2.0         3    13       6200          3000     2001
## 5    23985    2.0         3    12       6720          1280     2009
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1         1987   98102 47.6 -122          3940       8800
## 2         2001   98004 47.6 -122          3930      25449
## 3            0   98039 47.6 -122          4540      42730
## 4            0   98039 47.6 -122          3560      24345
## 5            0   98004 47.6 -122          4600      21750

3.1.2 What is the average price for each condition level?

From the slice below, average prices seem to trend upward along with condition; average prices are in the hundreds of thousands.

##   condition  price
## 1         1 341067
## 2         2 328149
## 3         3 542089
## 4         4 521274
## 5         5 612402

3.1.3 What is the average price for each grade level?

From the slice below, we can see that price generally trends upward along with grade.

##    grade   price
## 1      3  262000
## 2      4  212002
## 3      5  248524
## 4      6  301920
## 5      7  402566
## 6      8  542944
## 7      9  773513
## 8     10 1071771
## 9     11 1496842
## 10    12 2201285
## 11    13 3709615

3.2 House Size

add histograms for bedroom variable and sqft living variable - comment on them, why such house be that in real world etc etc..

Below we have included histograms for “bedrooms”, “sqft_living”, and “sqft_lot”. Upon inspecting the graphs, it becomes clear that most of the properties in this dataset have around 3 bedrooms, while the majority of properties are around 1000-2000 square feet (for reference, in 2015, the average US house size was around 2,600 square feet (https://bit.ly/32zY9Hi)); as for sqft_lot, most of the properties have between 5,000 and 10,000 square feet of land. As with the housing price histogram shown earlier, these histograms are right-skewed.

3.2.1 What do the largest houses look like?

introduce the reasoning of this passage and thendescribe each output with reason under each of them

Going into this project, we hypothesized that larger houses would be priced higher than smaller houses. House size is determined in large part by “sqft_living”, of which “bathrooms” and “bedrooms” are a part.

It is apparent here that the largest houses are also among the most expensive – they are all priced in the millions of dollars, which are outliers when compared to the dataset as a whole.

##           id            date   price bedrooms bathrooms sqft_living
## 1 9808700762 20140611T000000 7062500        5      4.50       10040
## 2 6762700020 20141013T000000 7700000        6      8.00       12050
## 3 1924059029 20140617T000000 4668000        5      6.75        9640
## 4 9208900037 20140919T000000 6885000        6      7.75        9890
## 5 1225069038 20140505T000000 2280000        7      8.00       13540
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1    37325    2.0         3    11       7680          2360     1940
## 2    27600    2.5         4    13       8570          3480     1910
## 3    13068    1.0         3    12       4820          4820     1983
## 4    31374    2.0         3    13       8860          1030     2001
## 5   307752    3.0         3    12       9410          4130     1999
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1         2001   98004 47.6 -122          3930      25449
## 2         1987   98102 47.6 -122          3940       8800
## 3         2009   98040 47.6 -122          3270      10454
## 4            0   98039 47.6 -122          4540      42730
## 5            0   98053 47.7 -122          4850     217800

The properties with the largest amount of land are also priced highly, but not as highly as those listed in the “sqft_living” readout. This could suggest a lower correlation between housing price and sqft_lot than that between housing price and sqft_living.

##           id            date  price bedrooms bathrooms sqft_living
## 1 1020069017 20150327T000000 700000        4      1.00        1300
## 2 3326079016 20150504T000000 190000        2      1.00         710
## 3 2623069031 20140521T000000 542500        5      3.25        3010
## 4 2323089009 20150119T000000 855000        4      3.50        4030
## 5  722069232 20140905T000000 998000        4      3.25        3770
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1  1651359    1.0         4     6       1300             0     1920
## 2  1164794    1.0         2     5        710             0     1915
## 3  1074218    1.5         5     8       2010          1000     1931
## 4  1024068    2.0         3    10       4030             0     2006
## 5   982998    2.0         3    10       3770             0     1992
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1            0   98022 47.2 -122          2560     425581
## 2            0   98014 47.7 -122          1680      16730
## 3            0   98027 47.5 -122          2450      68825
## 4            0   98045 47.5 -122          1830      11700
## 5            0   98058 47.4 -122          2290      37141

The properties with the largest number of bedrooms are also priced highly (around or above the dataset mean of $540,000), but not quite as highly as those in the “sqft_living” readout. Since “bedrooms” contributes in part – but not in whole – to sqft_living, it makes sense that its correlation with housing price is lower than that of sqft_living.

##           id            date   price bedrooms bathrooms sqft_living
## 1 1773100755 20140821T000000  520000       11      3.00        3000
## 2  627300145 20140814T000000 1148000       10      5.25        4590
## 3 5566100170 20141029T000000  650000       10      2.00        3610
## 4 8812401450 20141229T000000  660000       10      3.00        2920
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1     4960      2         3     7       2400           600     1918
## 2    10920      1         3     9       2500          2090     2008
## 3    11914      2         4     7       3010           600     1958
## 4     3745      2         4     7       1860          1060     1913
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1         1999   98106 47.6 -122          1420       4960
## 2            0   98004 47.6 -122          2730      10400
## 3            0   98006 47.6 -122          2040      11914
## 4            0   98105 47.7 -122          1810       3745
##           id            date   price bedrooms bathrooms sqft_living
## 1 1773100755 20140821T000000  520000       11      3.00        3000
## 2  627300145 20140814T000000 1148000       10      5.25        4590
## 3 5566100170 20141029T000000  650000       10      2.00        3610
## 4 8812401450 20141229T000000  660000       10      3.00        2920
##   sqft_lot floors condition grade sqft_above sqft_basement yr_built
## 1     4960      2         3     7       2400           600     1918
## 2    10920      1         3     9       2500          2090     2008
## 3    11914      2         4     7       3010           600     1958
## 4     3745      2         4     7       1860          1060     1913
##   yr_renovated zipcode  lat long sqft_living15 sqft_lot15
## 1         1999   98106 47.6 -122          1420       4960
## 2            0   98004 47.6 -122          2730      10400
## 3            0   98006 47.6 -122          2040      11914
## 4            0   98105 47.7 -122          1810       3745

4 Chapter 4: Factors Analysis

4.1 SMART Question

Are houses of different sizes priced differently?

Now that we’ve taken a look at slices of the data, we can now delve deeper with some graphs. Below are scatterplots and boxplots of housing price vs. “sqft_living”.

4.1.1 Comparison of sqft living with price

From the scatterplot, it’s apparent that there is a relatively strong, positive correlation between housing price and living space (.70192, to be exact). That is, as living space increases, so does housing price. Note that a majority of the data points lie below 6,000 sqft, and below $2 million.

Now let’s take a look at the same data with a boxplot; this time, we have “sqft_living” categorized by 5 intervals. From this visualization as well, it’s apparent that “sqft_living” correlates positively with housing price. The last interval (10,891-13,540 sqft) seems to buck this trend, but it worth noting that only 2 houses are part of this group – a small-n population, which could explain the discrepancy.

explains bp test with inline code and not possible anova

4.1.2 Comparison of sqft lot with price

Next up in our exploratory data analysis is housing price vs. “sqft_lot”. How does land area correlate with housing price? According to our scatterplot, not very highly – there is a positive correlation of only .08988. This seems to suggest that sqft_lot is more weakly related to housing price than sqft_living. Indeed, the vast majority of data points in the scatterplot seem to trend upward in price with relatively small increases in land area.

Next, let’s take a look at the same data in a boxplot. Unfortunately, the visualization isn’t very readable; let’s convert housing price through a logarithmic function to improve our y-axis scale.

The modified boxplot below (with the logarithmic scale) is much easier to interpret. We can see that housing price increases as land area increases, but only to an extent. Note that houses in the 991,000-1.3M sqft and 1.3M-1.65M sqft ranges appear to buck the trend. Once again, this can be explained by the fact that only a few houses are part of these two intervals – only 4 to be exact.

explain results of anova inline code etc

## 
##  studentized Breusch-Pagan test
## 
## data:  kc_house_data$price ~ sqft.lot
## BP = 0.2, df = 4, p-value = 1
##                Df           Sum Sq      Mean Sq F value    Pr(>F)    
## sqft.lot        4    3896930121642 974232530411    7.24 0.0000081 ***
## Residuals   21591 2906956970577216 134637440164                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = kc_house_data$price ~ sqft.lot)
## 
## $sqft.lot
##                         diff      lwr     upr p adj
## 330K-660K-520-330K    130987   -15181  277155 0.104
## 660K-991K-520-330K    619510   265542  973477 0.000
## 991K-1.3M-520-330K    -10511  -588471  567449 1.000
## 1.3M-1.65M-520-330K   160322  -840687 1161331 0.992
## 660K-991K-330K-660K   488523   105684  871361 0.005
## 991K-1.3M-330K-660K  -141498  -737577  454580 0.967
## 1.3M-1.65M-330K-660K   29335  -982244 1040914 1.000
## 991K-1.3M-660K-991K  -630021 -1307691   47650 0.083
## 1.3M-1.65M-660K-991K -459187 -1520893  602518 0.763
## 1.3M-1.65M-991K-1.3M  170833  -985006 1326672 0.994

4.1.3 Comparison of number of bedrooms with price

Here, we have a logarithmic box plot of housing price vs. “bedrooms”. There appears to be a clear trend: as the number of bedrooms increases, housing price increases as well. The 9-11 interval bucks the trend slightly, but again, this can be explained by the fact that only 10 houses are part of this interval, compared with 21586 total for the others.

same for anova bedrooms

## 
##  studentized Breusch-Pagan test
## 
## data:  kc_house_data$price ~ number.bedrooms
## BP = 202, df = 3, p-value <0.0000000000000002

can use chisq to see if more rooms more cost - smart question to intro, explain chisq (hypothesis, categ tranformation, results etc

## Warning in chisq.test(bed_p): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  bed_p
## X-squared = 2287, df = 21, p-value <0.0000000000000002
## Warning in chisq.test(bath_p): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  bath_p
## X-squared = 4534, df = 21, p-value <0.0000000000000002

4.2 SMART Question

Are houses of different quality priced differently? check all these questions make sense given what nick said

4.2.1 Comparison of condition with price

**explain well this variable, what represents etc as they asked*

Here, we compare “condition” with housing price. Once again, “condition” represents an index from 1 to 5, with the lowest number representing poor condition. Once we take a look at the boxplot below (the second one is logarithmized for clearer visualization), it becomes clear that apartment condition correlates positively with housing price.

explains anova test tukey and all

## 
##  studentized Breusch-Pagan test
## 
## data:  price ~ condition
## BP = 5, df = 4, p-value = 0.3
##                            Df           Sum Sq       Mean Sq F value
## kc_house_data$condition     4   19739857713567 4934964428392    36.9
## Residuals               21591 2891114042985296  133903665554        
##                                      Pr(>F)    
## kc_house_data$condition <0.0000000000000002 ***
## Residuals                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = kc_house_data$price ~ kc_house_data$condition)
## 
## $`kc_house_data$condition`
##       diff     lwr    upr p adj
## 2-1 -12918 -213478 187642 1.000
## 3-1 201022   15459 386585 0.026
## 4-1 180207   -5637 366051 0.062
## 5-1 271335   84389 458280 0.001
## 3-2 213940  136914 290965 0.000
## 4-2 193125  115424 270825 0.000
## 5-2 284253  203953 364552 0.000
## 4-3 -20815  -36519  -5111 0.003
## 5-3  70313   44676  95950 0.000
## 5-4  91128   63529 118727 0.000

4.2.2 Comparison of grade with price

explain well this variable, what represents etc as they asked, design/construction

Here we have a boxplot comparing “grade” with housing price. Once again, “grade” represents an index from 1 to 13, with the lowest number representing poor construction and design. The trend is clear: construction and design grade correlate positively with housing price.

no anova..

explains chi-sq results and how they related

## Warning in chisq.test(cond.tbl): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  cond.tbl
## X-squared = 1457, df = 40, p-value <0.0000000000000002

4.3 SMART Question

Are older houses priced differently?

4.3.1 Comparison of year they were built with price

Here, we have a comparison of “yr_built” with housing price. Once we take a look at the logarithmic boxplot, we see no obvious trends. Housing price trends downward from 1900-1969, and then picks back up from 1970-2015. What might explain this? Well, “yr_built” does not take “yr_renovated” into account. For instance, two equivalent houses built in the same year could have different house prices, depending on if one has been renovated while the other hasn’t.

Let’s construct the same boxplots, but this time indexed by “yr_renovated”.

Unfortunately, the vast majority of properties in this dataset have never been renovated (20682 to be exact). This means that only 914 properties have been renovated. This makes the resulting boxplots somewhat uninformative – the larger population boxplot (not renovated) largely mirrors the patterns of the previous graph, and the smaller population graph (renovated) is based on a population too small to run meaningful analysis on. We have included the graphs here to showcase our thought process, but we are well aware of their limitations.

Here, we’ve graphed housing price by yr_renovated itself. This graph also showcases only 914 properties – the ones that were renovated. Generally speaking, as “yr_renovated” approaches the present day, price increases. The exception is between the 1924-1946 and 1947-1969 intervals; note however, that only 9 properties occupy the first interval.

no anova..

## 
##  studentized Breusch-Pagan test
## 
## data:  kc_house_data$price ~ year.built
## BP = 41, df = 4, p-value = 0.00000003

explain results and problems with it, introduce for next analysis below

4.3.2 Comparison of year renovated with price

charts and all

do anova for this one too

compared renovated with built and see price changes too..

4.4 SMART Question

Is quality affected by house age?

4.4.1 Chi-square test for year they were built and grade

use chisq to assess old and quality independence

## Warning in chisq.test(gradey.tbl): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  gradey.tbl
## X-squared = 6404, df = 40, p-value <0.0000000000000002
## Warning in chisq.test(condy.tbl): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  condy.tbl
## X-squared = 4844, df = 16, p-value <0.0000000000000002

4.4.2 Does quality get better when renovated?

use chisq to assess renovated and quality indep - replicate above chisq tests for year.renov

subset for older houses with renovation and rerun both tests and see difference in time

like older house renovated cost normal? median house are still not renovated? how is the condition grade for older one? how is it for older renovated how is it for newer? how price changes in all these dynamics etc like they wanted us to go in depth of that cause it actually makes sense

5 Chapter 5: Multiple Linear Regression Model

5.1 SMART Question:

What factors influence the house price the most?

5.1.1 LSRL Model building

Below is our regression model, along with a comprehensive correlation plot.

5.1.1.1 First, take a look at all numeric variables and their correlation.

yr_built and sqft_lot seem unrelated to price as their correlation coefficient is almost 0; accordingly, we do not choose them as independent variables to predict house price.

## 
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built, data = h2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1706400  -142615   -21697   101727  4133425 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    80549.20    7851.17   10.26 < 0.0000000000000002 ***
## bedrooms      -64528.79    2446.40  -26.38 < 0.0000000000000002 ***
## bathrooms       5499.45    3852.17    1.43              0.15341    
## sqft_living      340.68       4.99   68.33 < 0.0000000000000002 ***
## floors         14875.02    4299.43    3.46              0.00054 ***
## sqft_above       -36.54       5.03   -7.26      0.0000000000004 ***
## sqft_basement        NA         NA      NA                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 257000 on 21590 degrees of freedom
## Multiple R-squared:  0.51,   Adjusted R-squared:  0.509 
## F-statistic: 4.49e+03 on 5 and 21590 DF,  p-value: <0.0000000000000002

The coefficient of “sqft_basement” is NA, which indicates it has a problem with the other variables, so we dropped this one. And the p-value of “bathroom” is too large (meaning it’s insignificant), so we dropped this one as well.

## 
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built - sqft_basement - 
##     bathrooms, data = h2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1711175  -142619   -21684   101736  4134675 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  81162.16    7839.61   10.35 < 0.0000000000000002 ***
## bedrooms    -63928.53    2410.05  -26.53 < 0.0000000000000002 ***
## sqft_living    343.98       4.42   77.89 < 0.0000000000000002 ***
## floors       17336.10    3938.78    4.40    0.000010807578329 ***
## sqft_above     -37.41       5.00   -7.49    0.000000000000074 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 257000 on 21591 degrees of freedom
## Multiple R-squared:  0.51,   Adjusted R-squared:  0.509 
## F-statistic: 5.61e+03 on 4 and 21591 DF,  p-value: <0.0000000000000002
##    bedrooms sqft_living      floors  sqft_above 
##        1.55        5.37        1.48        5.59

Everything looks better now; we also checked the VIF value of each variable and none of them is too large, indicating no multicollineraty. We then added the two factor variables (“grade” and “condition”) into the dataset to see their effects.

##      price            bedrooms      sqft_living      sqft_above  
##  Min.   :  78000   Min.   : 1.00   Min.   :  370   Min.   : 370  
##  1st Qu.: 322000   1st Qu.: 3.00   1st Qu.: 1430   1st Qu.:1190  
##  Median : 450000   Median : 3.00   Median : 1910   Median :1560  
##  Mean   : 540198   Mean   : 3.37   Mean   : 2080   Mean   :1789  
##  3rd Qu.: 645000   3rd Qu.: 4.00   3rd Qu.: 2550   3rd Qu.:2210  
##  Max.   :7700000   Max.   :11.00   Max.   :13540   Max.   :9410  
##                                                                  
##      floors         grade      condition
##  Min.   :1.00   7      :8973   1:   29  
##  1st Qu.:1.00   8      :6065   2:  170  
##  Median :1.50   9      :2615   3:14020  
##  Mean   :1.49   6      :2038   4: 5677  
##  3rd Qu.:2.00   10     :1134   5: 1700  
##  Max.   :3.50   11     : 399            
##                 (Other): 372
## 
## Call:
## lm(formula = price ~ ., data = h3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1571904  -122108   -22410    88448  4645573 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  119945.02  235012.20    0.51              0.60979    
## bedrooms     -27981.05    2255.58  -12.41 < 0.0000000000000002 ***
## sqft_living     229.90       4.37   52.66 < 0.0000000000000002 ***
## sqft_above      -89.75       4.65  -19.29 < 0.0000000000000002 ***
## floors        25574.69    3798.02    6.73       0.000000000017 ***
## grade4        52746.97  235247.78    0.22              0.82259    
## grade5        47680.72  231471.19    0.21              0.83680    
## grade6        77263.78  231054.85    0.33              0.73808    
## grade7       110396.50  231036.91    0.48              0.63278    
## grade8       183631.54  231071.92    0.79              0.42680    
## grade9       327826.56  231145.29    1.42              0.15613    
## grade10      530666.16  231258.19    2.29              0.02176 *  
## grade11      829031.75  231551.46    3.58              0.00034 ***
## grade12     1357863.56  232717.34    5.83       0.000000005462 ***
## grade13     2552196.47  240535.18   10.61 < 0.0000000000000002 ***
## condition2   -60791.74   46520.27   -1.31              0.19130    
## condition3   -62199.90   43243.71   -1.44              0.15035    
## condition4    -5587.43   43281.73   -0.13              0.89728    
## condition5    71583.51   43532.93    1.64              0.10012    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 231000 on 21577 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.604 
## F-statistic: 1.83e+03 on 18 and 21577 DF,  p-value: <0.0000000000000002
##    bedrooms sqft_living  sqft_above      floors      grade4      grade5 
##        1.68        6.51        6.01        1.70       27.99      240.44 
##      grade6      grade7      grade8      grade9     grade10     grade11 
##     1847.93     5250.37     4367.68     2302.97     1077.66      393.79 
##     grade12     grade13  condition2  condition3  condition4  condition5 
##       90.02       14.10        6.85      172.49      147.02       55.66

The 5 levels of the “condition” variable are all insignificant, so we can drop the “condition” variable. For the “grade” variable, higher grade levels have significant effects on price. By contrast, low grade does not affect price significantly.

## 
## Call:
## lm(formula = price ~ . - condition, data = h3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1594879  -122654   -26548    89616  4612348 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  203437.68  234053.01    0.87               0.3848    
## bedrooms     -25531.89    2283.29  -11.18 < 0.0000000000000002 ***
## sqft_living     241.42       4.40   54.90 < 0.0000000000000002 ***
## sqft_above     -101.49       4.69  -21.65 < 0.0000000000000002 ***
## floors        11331.97    3765.10    3.01               0.0026 ** 
## grade4       -58531.64  238316.99   -0.25               0.8060    
## grade5       -47771.75  234518.50   -0.20               0.8386    
## grade6       -24710.22  234099.87   -0.11               0.9159    
## grade7         2597.97  234076.33    0.01               0.9911    
## grade8        71562.72  234108.48    0.31               0.7598    
## grade9       212286.01  234180.94    0.91               0.3647    
## grade10      412631.72  234294.46    1.76               0.0782 .  
## grade11      707325.79  234587.29    3.02               0.0026 ** 
## grade12     1233712.59  235767.03    5.23           0.00000017 ***
## grade13     2416245.65  243679.96    9.92 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 234000 on 21581 degrees of freedom
## Multiple R-squared:  0.594,  Adjusted R-squared:  0.594 
## F-statistic: 2.26e+03 on 14 and 21581 DF,  p-value: <0.0000000000000002
##    bedrooms sqft_living  sqft_above      floors      grade4      grade5 
##        1.68        6.43        5.94        1.63       27.97      240.31 
##      grade6      grade7      grade8      grade9     grade10     grade11 
##     1846.94     5247.31     4365.02     2301.52     1076.98      393.53 
##     grade12     grade13 
##       89.96       14.09

Now, we’ve added the interaction term into the model, since we want to see if the correlation of variables would affect the price prediction. We first put all interactions into the model to see what would happen.

## 
## Call:
## lm(formula = price ~ . + bedrooms:sqft_living + bedrooms:floors + 
##     bedrooms:sqft_above + sqft_living:floors + sqft_living:sqft_above + 
##     floors:sqft_above, data = h4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3695612  -119737   -25450    86085  3346184 
## 
## Coefficients:
##                             Estimate    Std. Error t value
## (Intercept)             245214.63848  229404.26957    1.07
## bedrooms                -51836.57174    7294.93923   -7.11
## sqft_living                 86.13585      16.53152    5.21
## sqft_above                 -19.70424      21.95986   -0.90
## floors                   36232.45799   14886.82743    2.43
## grade4                  -45196.54696  233098.37846   -0.19
## grade5                  -15410.52348  229421.70528   -0.07
## grade6                   21127.44809  229036.61201    0.09
## grade7                   74426.60451  229048.12877    0.32
## grade8                  160884.19715  229093.63765    0.70
## grade9                  310355.78334  229162.33811    1.35
## grade10                 492220.91881  229263.64399    2.15
## grade11                 721454.84294  229542.28824    3.14
## grade12                1105961.32670  230837.39838    4.79
## grade13                1837445.26776  239796.69406    7.66
## bedrooms:sqft_living       -16.02514       3.80857   -4.21
## bedrooms:floors          25728.79001    5273.85824    4.88
## bedrooms:sqft_above         20.28226       5.16049    3.93
## sqft_living:floors         101.59363       8.63259   11.77
## sqft_living:sqft_above       0.03473       0.00195   17.79
## sqft_above:floors         -177.53444       9.78398  -18.15
##                                    Pr(>|t|)    
## (Intercept)                          0.2851    
## bedrooms                  0.000000000001233 ***
## sqft_living               0.000000190166631 ***
## sqft_above                           0.3696    
## floors                               0.0149 *  
## grade4                               0.8463    
## grade5                               0.9464    
## grade6                               0.9265    
## grade7                               0.7452    
## grade8                               0.4825    
## grade9                               0.1757    
## grade10                              0.0318 *  
## grade11                              0.0017 ** 
## grade12                   0.000001669853101 ***
## grade13                   0.000000000000019 ***
## bedrooms:sqft_living      0.000025908368174 ***
## bedrooms:floors           0.000001076292883 ***
## bedrooms:sqft_above       0.000085104651817 ***
## sqft_living:floors     < 0.0000000000000002 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## sqft_above:floors      < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229000 on 21575 degrees of freedom
## Multiple R-squared:  0.612,  Adjusted R-squared:  0.611 
## F-statistic: 1.7e+03 on 20 and 21575 DF,  p-value: <0.0000000000000002
##               bedrooms            sqft_living             sqft_above 
##                   17.9                   95.0                  136.2 
##                 floors                 grade4                 grade5 
##                   26.6                   28.0                  240.4 
##                 grade6                 grade7                 grade8 
##                 1848.1                 5252.3                 4369.7 
##                 grade9                grade10                grade11 
##                 2304.0                 1078.0                  393.9 
##                grade12                grade13   bedrooms:sqft_living 
##                   90.2                   14.3                  144.4 
##        bedrooms:floors    bedrooms:sqft_above     sqft_living:floors 
##                   70.0                  194.0                  150.2 
## sqft_living:sqft_above      sqft_above:floors 
##                   32.1                  173.0

We dropped the insignificant interactions and some interactions would cause certain variables to be insignificant as well, so we also drop these variables. Here is what’s left; this model seems nice.

## 
## Call:
## lm(formula = price ~ . + bedrooms:sqft_above + sqft_living:sqft_above, 
##     data = h4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3552516  -119684   -27362    86763  3536811 
## 
## Coefficients:
##                             Estimate    Std. Error t value
## (Intercept)             301536.38692  230960.70303    1.31
## bedrooms                -37942.22923    5140.56726   -7.38
## sqft_living                174.31529       5.84404   29.83
## sqft_above                -232.12312       9.01099  -25.76
## floors                   14271.05585    3722.97985    3.83
## grade4                  -35043.35004  235079.61029   -0.15
## grade5                    9116.80210  231363.17493    0.04
## grade6                   50285.38874  230969.99192    0.22
## grade7                  107201.66962  230973.41942    0.46
## grade8                  195978.96124  231016.35221    0.85
## grade9                  342536.69129  231090.06081    1.48
## grade10                 523989.90325  231190.83681    2.27
## grade11                 751707.87220  231468.92597    3.25
## grade12                1141807.53068  232773.60628    4.91
## grade13                1933822.97431  241686.51914    8.00
## bedrooms:sqft_above         12.57344       2.48837    5.05
## sqft_living:sqft_above       0.02832       0.00169   16.75
##                                    Pr(>|t|)    
## (Intercept)                         0.19171    
## bedrooms                 0.0000000000001629 ***
## sqft_living            < 0.0000000000000002 ***
## sqft_above             < 0.0000000000000002 ***
## floors                              0.00013 ***
## grade4                              0.88150    
## grade5                              0.96857    
## grade6                              0.82765    
## grade7                              0.64256    
## grade8                              0.39626    
## grade9                              0.13828    
## grade10                             0.02343 *  
## grade11                             0.00117 ** 
## grade12                  0.0000009399814052 ***
## grade13                  0.0000000000000013 ***
## bedrooms:sqft_above      0.0000004387237711 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 231000 on 21579 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.605 
## F-statistic: 2.07e+03 on 16 and 21579 DF,  p-value: <0.0000000000000002
##               bedrooms            sqft_living             sqft_above 
##                   8.75                  11.67                  22.55 
##                 floors                 grade4                 grade5 
##                   1.64                  27.97                 240.39 
##                 grade6                 grade7                 grade8 
##                1847.89                5251.23                4368.71 
##                 grade9                grade10                grade11 
##                2303.51                1077.80                 393.79 
##                grade12                grade13    bedrooms:sqft_above 
##                  90.13                  14.24                  44.35 
## sqft_living:sqft_above 
##                  23.68

5.1.2 Final results and approved model for prediction of price

explain coefficients and all that - use inline coding etc.. ### Price = 142000 +bedrooms(-31710+10.72sqft_above)+sqft_living(170+2.943sqft_above)+sqft_above(-228.6)+floors14570+grade()*

Problem: As the price histogram above is quite left-skewed, it means there are many outliers whose price is very high in the dataset. While we built the model, we did not exclude the outliers as we considered these values important. As a result, our final model is also skewed a bit. It means that for low price houses, our model may predict higher-than-normal prices, and for high price houses, our model will predict lower-than-normal prices.

6 Chapter 6: Conclusion

final insights, main relationships, predictors, what to look in a real estate dataset etc.. future openings for further studies, analysis, tests etc on this..

7 Bibliography